Univariate Plots

PH345: Winter 2025

Phil Boonstra

Univariate Plots

Univariate plots are used to visualize the distribution of a single variable.

Examples

  • Histograms
  • Boxplots
  • Barplots

Ultra-runner data (Samtleben, 2023)

\(n = 288\) ultra-runners (completing 100km ultra-marathons)

Each runner’s personal best (in hours):

  [1] 14.00  7.60 14.20 14.33 17.00 12.00 16.00 16.16  9.95 17.55 12.50 23.00
 [13] 18.85  8.50 19.33 16.00 18.00 12.00 14.38 15.00 19.25 14.00 14.21 14.91
 [25] 14.50 19.00 18.50 15.00 20.00 12.16 14.82 12.99 13.50 12.98  9.20 10.00
 [37] 13.55 14.00 15.00 14.00 22.00 15.33 15.53 12.26 12.00 13.00 10.77 13.98
 [49] 14.00 16.00 12.05 20.83 14.00 14.30 14.78 13.83 16.00 12.90 19.67 14.00
 [61] 16.67 10.25 15.38 13.35 14.00 22.00  7.15 14.00 12.00 16.00  9.50 15.13
 [73] 12.99 18.77 15.00 11.25 14.00 13.00 14.53 18.75 16.00 14.50 18.66 15.50
 [85] 12.77  9.05 16.30 17.00 22.00  9.50 15.46  8.70 16.75 12.00 14.41 10.50
 [97] 17.00 11.17 15.50 17.00 13.86 20.00 10.45 10.34 13.33 14.50  7.90 11.00
[109] 10.71 12.00 15.36 19.41 14.00  9.00 15.16 12.00 18.81 10.50 12.00 14.00
[121] 15.00  9.00 20.00 21.50 11.33 15.00 21.25 23.00 22.00 18.60 21.90 16.16
[133] 15.50 13.71 23.50 10.33  8.70 18.00 12.83 10.49 13.33 14.86 19.99 15.66
[145] 22.36 22.40 16.00 16.52 11.25 13.06  9.60 14.25 20.00 20.00 13.75 10.34
[157] 12.25 13.25 12.00 10.95 16.75 13.25 14.00 13.65 18.00 18.00 15.00 12.70
[169] 17.50 19.66 11.51 12.71 12.00 17.00 13.00  6.50 19.00 19.70 14.25  9.86
[181] 23.00 15.33 14.65 15.60 22.00 14.00 14.00 16.86 14.51 13.51 13.75 18.51
[193] 19.75 20.80 15.99 16.34 25.00 13.00 16.88 12.95 11.50 12.75 11.16 12.70
[205] 10.13 17.01 11.24 12.60 20.00 14.01 13.05 13.18 12.00 12.00 15.38 15.00
[217] 10.52 15.16  9.90 13.50 21.68 20.00 19.00 12.00 14.91 11.00 14.36 11.00
[229] 17.00 11.99 12.46 20.00 15.01 12.41 13.49 14.00 13.20 13.55 13.96 10.95
[241] 16.00 11.80 17.00 11.65 13.58 13.09 13.86 16.00 15.00 12.08 14.16 11.00
[253] 18.00 12.85 22.00 11.50 14.66 10.16 13.00  7.50 19.84 16.75 12.00 25.25
[265] 15.50 13.36 10.00 17.00 12.83 16.00 12.50 16.00  9.18 16.50 14.41 14.25
[277] 19.00 15.00 13.36 17.83 10.50 11.75 12.75 19.75 15.40 21.00 18.00 14.46

Creating a histogram

  1. Choose a bin size and a center value, e.g. one hour bins centered at the integers would be denoted as \((5.5, 6.5]\), \((6.5, 7.5]\), \((7.5, 8.5]\), etc. Bins must be non-overlapping, and there should be enough bins to completely cover the data.

  2. Assign each runner to a bin, e.g. 12.98 goes into the \((12, 13]\) bin and 12.0 goes in to the \((11, 12]\) bin

  3. Plot bars for each bin, with the height of the bar corresponding to the number of runners in that bin

ggplot(ultrarunning) + 
  geom_histogram(aes(x = pb100k_dec), binwidth = 1, center = 10, fill = "grey", color = "black") + 
  labs(x = "Personal best time (hours)",
       y = "Count") +
  theme(text = element_text(size = 24))

Bin width of 10 hours – too large

Bin width of 3 minutes – too small

Boxplots

Find the five-number summary: minimum, lower quartile, median, upper quartile, maximum

quantile(ultrarunning$pb100k_dec)
     0%     25%     50%     75%    100% 
 6.5000 12.2575 14.2050 16.7500 25.2500 

IQR = upper quartile - lower quartile

ggplot(ultrarunning) +
  geom_boxplot(aes(x = pb100k_dec, y = 1)) +
  labs(x = "Personal best time (hours)") +
  theme(text = element_text(size = 24)) +
  scale_y_continuous(breaks = NULL, name = NULL, limits = c(0, 2)) +
  theme(text = element_text(size = 24))

Example: Hodgkin Lymphoma

  • Cancer of the lymphatic system
  • Occurs in most often in young adults (age 20-29) and elderly (75-84)

HL age of diagnosis in UK females

Potentially misleading conclusion when looking at boxplot alone

Boxplots do not show multimodality

Barcharts

# Create a new variable to store the surface type
ultrarunning <-
  ultrarunning %>%
  mutate(pb_surface_name = 
           case_when(
             pb_surface == 1 ~ "trail",
             pb_surface == 2 ~ "track",
             pb_surface == 3 ~ "road",
             pb_surface == 4 ~ "mix of all three"
           ))

ggplot(ultrarunning) +
  geom_bar(aes(x = pb_surface_name)) +
  labs(x = "Surface type",
       y = "Count") +
  theme(text = element_text(size = 24))

Comparison to histograms

  • Barcharts are used for categorical data

References

Samtleben, E. (2023) Ultrarunning dataset. Teaching of Statistics in the Health Sciences Resource Portal, Available at https://www.causeweb.org/tshs/ultra-running/.